Overview

Column

In this assignment, we look at a sample set of clients of an insurance company. By looking at factors like age, children, and if they are a smoker, we see how their payments can be affected. What we end up finding is that smoking drastically increases the charges for a given customer. Additionally, age increases payments slightly over time.

Column

  1. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.
Rows: 1,338
Columns: 7
$ age      <dbl> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1…
$ sex      <chr> "female", "male", "male", "male", "male", "female", "female",…
$ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74…
$ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0…
$ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", …
$ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes…
$ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,…

There are 7 variables and 1,338 observations.

Region

Column

  1. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.

While this insurance has coverage all across the country, the plurality of the covered clients are in the southeast region.

Column

  1. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.

BMI

Column

Histogram of BMI

  1. Create a histogram of bmi. Discuss the distribution of the histogram.

The distribution shows that most of the clients fall within the 20 - 40 range for bmi, with a few outliers.

Column

Histogram of Charges

  1. Create a histogram of charges, Discuss the distribution of the histogram.
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1122    4740    9382   13270   16640   63770 

This histogram shows that most of the clients charges are below 20000 and many of those are below 10000.

BMI based on Region

  1. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)

What we can see by this boxplot is that the highest average bmi comes from the southeast region. The other regions all have reasonably similar distributions of bmi.

Age and Charges

Column

  1. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.

The scatterplot shows a moderate positive relationship between age and charges. For each age, the majority of clients are on the lower end of the charge range.

Charges and Smoking

Column

  1. You should find that it seems “charges” could be classified into several groups. Let’s create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.

This plot shows that smokers have higher charges for their age group and that no one in the lowest charge range is a smoker.

Column

  1. Now, create two data frames by subsetting insurance data as follows. smoker <- insurance[insurance\(smoker=="yes"] nonsmoker <- insurance[insurance\)smoker==“no”]

Column

  1. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?

I don’t think it makes sense to use a straight line to represent this data. Since very little of the data actually falls on that line, it doesn’t exactly match the data. However, it shows the positive correlation between the variables.

Column

  1. Repeat Question 11 using the data frame nonsmoker.

This time, the single straight line more accurately describes the data.

Children vs. No Children

Column

  1. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

I would want to see similar data with charges based on no children vs. children or based on region and bmi.

  1. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)

Column

  1. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.
---
title: "Assignment 7"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    source_code: embed
---



```{r setup, include=FALSE, embed=TRUE}
library(flexdashboard)
library(tidyverse)
insurance <- read_csv("insurance.csv")
```

Overview
===

Column {data-width=500}
---

In this assignment, we look at a sample set of clients of an insurance company. By looking at factors like age, children, and if they are a smoker, we see how their payments can be affected. What we end up finding is that smoking drastically increases the charges for a given customer. Additionally, age increases payments slightly over time. 


Column {data-width=500}
---

2. Get a glimpse of the data and indicate the number of observations and the number of variables in the data.
```{r}
glimpse(insurance)
attach(insurance)
```
There are 7 variables and 1,338 observations.

Region
===

Column {data-width=500}
---

3. Create a bar plot of region. Use a few sentences to summarize your finding based on the plot.
```{r}
ggplot(insurance, aes(x = region)) +
  geom_bar()
```

While this insurance has coverage all across the country, the plurality of the covered clients are in the southeast region.

Column {data-width=500}
---

4. Create a stack bar plot such that region is on the x axis and each bar shows the distribution of smoker in that region. You should make sure that your y axis shows percents.
```{r}
smoker_distribution <- insurance %>%
  group_by(region, smoker) %>%
  summarise(count = n()) %>%
  group_by(region) %>%
  mutate(percent = count / sum(count) * 100)

ggplot(smoker_distribution, aes(x = region, y = percent, fill = smoker)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Smokers in Each Region",
       x = "Region",
       y = "Percentage of Smokers") +
  scale_fill_manual(values = c("yes" = "navy", "no" = "skyblue"))
```

BMI
===

Column {data-width=500}
---

## Histogram of BMI

5. Create a histogram of bmi. Discuss the distribution of the histogram.
```{r}
ggplot(insurance, aes(x = bmi))+
  geom_histogram()
```

The distribution shows that most of the clients fall within the 20 - 40 range for bmi, with a few outliers.

Column {data-width=500}
---

## Histogram of Charges

6. Create a histogram of charges, Discuss the distribution of the histogram.
```{r}
summary(charges)
ggplot(insurance, aes(x = charges)) + 
  geom_histogram()
```

This histogram shows that most of the clients charges are below 20000 and many of those are below 10000. 

## BMI based on Region

7. Create a boxplot that shows the distribution of bmi based on the region. Discuss what you find based on the boxplot. (Hint: you need to have x and y variables in mapping)
```{r}
ggplot(insurance, aes(x = region, y = bmi, fill = region)) +
  geom_boxplot() +
  labs(title = "Distribution of BMI According to Region",
       x = "Region",
       y = "BMI")
```

What we can see by this boxplot is that the highest average bmi comes from the southeast region. The other regions all have reasonably similar distributions of bmi.

Age and Charges
===

Column {data-width=1000}
---

8. Create a scatterplot that shows the relationship between age (independent variable) and charges (dependent variable). Comment on the scatterplot.
```{r}
ggplot(insurance, aes(x = age, y = charges)) +
  geom_point() +
  labs(title = "Relationship between Age and Charges",
       x = "Age",
       y = "Charges")
```

The scatterplot shows a moderate positive relationship between age and charges. For each age, the majority of clients are on the lower end of the charge range.

Charges and Smoking
===

Column {data-width=500}
---

9. You should find that it seems "charges" could be classified into several groups. Let's create a scatterplot that has age as the independent variable (x) and has smoker as another categorical variable (color), and the response variable is charges. Comment on the scatterplot.
```{r}
ggplot(insurance, aes(x = age, y = charges, color = smoker)) +
  geom_point() +
  labs(title = "Relationship between Age, Charges, and Smoker Status",
       x = "Age",
       y = "Charges",
       color = "Smoker Status")
```

This plot shows that smokers have higher charges for their age group and that no one in the lowest charge range is a smoker.

Column {data-width=500}
---

10. Now, create two data frames by subsetting insurance data as follows.
smoker <- insurance[insurance$smoker=="yes"]
nonsmoker <- insurance[insurance$smoker=="no"]
```{r}
smoker <- insurance %>%
  filter(smoker == "yes")

nonsmoker <- insurance %>%
  filter(smoker == "no")
```

Column {data-width=500}
---

11. Create a scatterplot that has age as the independent variable (x) and the response variable is charges using the data frame smoker. Then add the smooth line. Comment on the plot. Does it make sense to use the smooth line to summarize the relationship between age of clients and the corresponding charges? Why?
```{r}
ggplot(smoker, aes(x = age, y = charges)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Age and Charges for Smokers",
       x = "Age",
       y = "Charges")
```

I don't think it makes sense to use a straight line to represent this data. Since very little of the data actually falls on that line, it doesn't exactly match the data. However, it shows the positive correlation between the variables.

Column {data-width=500}
---

12. Repeat Question 11 using the data frame nonsmoker.
```{r}
ggplot(nonsmoker, aes(x = age, y = charges)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Age and Charges for Smokers",
       x = "Age",
       y = "Charges")
```

This time, the single straight line more accurately describes the data.

Children vs. No Children
===

Column {data-width=500}
---

13. Based on the finding you have on Questions 11 & 12, propose what you might do next if you want to model charges using other variables in this data.

I would want to see similar data with charges based on no children vs. children or based on region and bmi.

14. Create a pie chart of children. Use a few sentences to summarize your finding based on the plot. (Hint: You need to convert the variable to a categorical variable first)
```{r}
insurance_children <- insurance %>%
  mutate(children_grouped = case_when(
    children == 0 ~ "No Children",
    children > 0 ~ "Children"
  ))
ggplot(insurance_children, aes(x = "", fill = children_grouped)) + 
  geom_bar(width = 1, color = "white") +
  coord_polar("y", start = 0)

```

Column {data-width=500}
---

15. Create a boxplot that shows the distribution of charges based on the number of children. Discuss what you find based on the boxplot.
```{r}
ggplot(insurance, aes(x = children, y = charges)) +
  geom_boxplot() +
  labs(title = "Distribution of Charges Based on Number of Children",
       x = "Number of Children",
       y = "Charges")
```